Predicting Vehicle MSRP using Multiple Regression Analysis

By: Connor Schultz and Aiden Bull

1. Introduction

Predictive analytics is an important branch of statistics and data mining that is used to predict outcomes of future unknown events. When applied to price prediction, predictive analytics can heavily benefit both the consumers and manufacturers of products. Consumers benefit by ensuring that they are getting the best deal for a given product, and manufacturers benefit by ensuring that the determined MSRP of a given product accurately reflects the market value of similar products. Since cars are generally quite expensive, predictive analytics is particularly useful to aid in the decision making of both the consumers and the manufacturers. This study aims at using data preprocessing, encoding, attribute engineering and multiple regression to accurately predict the MSRP of new vehicles from the specifications of a particular vehicle. Various methods were tested to attempt to improve accuracy and reduce computation complexity.

2. Data

The dataset used for our regression analysis was gathered from a reddit post from the r/datasets subreddit [1]. The dataset was posted on March 28th, 2019, by user u/nicolas-gervais. The dataset contains 32,316 instances of cars, and was gathered using Python libraries from The Car Connection’s website [2]. The dataset also contains 235 car specifications, which when paired with the large number of instances, made this dataset suitable for a multiple regression analysis. A histogram belows shows the distribution of MSRP values in the original dataset.

3. Analysis Methods

The aim of our regression analyses was predicting the MSRP values of car instances based on their specifications. Before regression could be performed, the initial dataset required heavy preprocessing. Once the dataset was processed, the data was split into training and test sets. We then built regression models on the training sets using three types of regression: linear regression using least squares, ridge regression, and lasso regression. Afterwards, we used the models to iterate through the test set and predict each of the test instances’ MSRP values. Finally, we stored error values for each prediction.

3.1. Data Preprocessing

The data required heavy preprocessing before it could be used for our purposes. First , many instances were missing data on certain attributes. The regression methods we used required numeric values for all attributes, and so we had to decide on a method of fixing this. Another issue with the dataset was that many of the attributes were quite obscure and had large ranges of possible values, with low frequencies for each value. Attributes such as these are not ideal for building a regression model that provides accurate predictions. As it turns out, the obscure attributes often had a large number of missing values for the given car instances. This correlation was used to solve both of these issues. The final issue with the dataset was that many of the attributes were not numeric, but were rather categorical and stored using text. The major issue with these categorical attributes was that the categories did not follow strict naming conventions. There were many categorical values that were stored using different names, but had identical semantics. One example of this would be the values “4wd”, “four-wheel-drive”, “4 wheel drive”, and “Four-wd” belonging to the “Drivetrain” attribute. Each of these categorical values refer to the same drivetrain, but are recognized differently because they have different representations. Fixing these types of issues, then encoding them to numeric attributes contributed to the bulk of the time spent preprocessing.

The data required heavy preprocessing before it could be used for our purposes. First , many instances were missing data on certain attributes. The regression methods we used required numeric values for all attributes, and so we had to decide on a method of fixing this. Another issue with the dataset was that many of the attributes were quite obscure and had large ranges of possible values, with low frequencies for each value. Attributes such as these are not ideal for building a regression model that provides accurate predictions. As it turns out, the obscure attributes often had a large number of missing values for the given car instances. This correlation was used to solve both of these issues. The final issue with the dataset was that many of the attributes were not numeric, but were rather categorical and stored using text. The major issue with these categorical attributes was that the categories did not follow strict naming conventions. There were many categorical values that were stored using different names, but had identical semantics. One example of this would be the values 4wd, four-wheel-drive, 4 wheel drive, and Four-wd belonging to the Drivetrain attribute. Each of these categorical values refer to the same drivetrain, but are recognized differently because they have different representations. Fixing these types of issues, then encoding them to numeric attributes contributed to the bulk of the time spent preprocessing.

To fix the issue of missing attributes, we had a choice between solutions. One solution was to assign values based on a set of rules, such as assigning each value to 0, or assigning each value to the mean of the given attribute. This solution would fix the problem of missing values, but would add incorrect information to the dataset. For a given car instance with a missing attribute value, it is possible that the correct value for that attribute exists, in which case assigning a value of 0 would be removing value from the car. Similarly replacing the missing attribute value with the mean of the attribute would be unlikely to match the true value exactly, and in the case that the attribute value is low frequency and obscure, it could add value to cars that should not receive it.

Another solution to the missing attribute problem is to just remove the missing instances or attributes entirely. The issue with this method is that it reduces the size and information richness of the dataset. However, as mentioned earlier, it was noticed that a lot of the obscure attribute types were often the ones that had instances with missing values. Because of this, the attributes that were deemed too obscure and not valuable a priori were removed. This fixed the missing attribute problem, while also fixing the issue of having obscure attributes that did not contribute to the analysis. Afterwards, any remaining instances with missing attribute values were removed, which reduced our dataset size to 23,844 but fixed the missing attribute problem.

Furthermore, in order to aggregate the low frequency attribute values, values that were equivalent but encoded differently were combined to a single value using regular expressions and replace functions. For example, in the Engine Type attribute, the values gasv6 and regularunleadedv6 were aggregated together as gasv6.

After the data was aggregated the categorical attributes were encoded as numeric values. To accomplish this a one-hot encoding scheme was applied to the categorical attributes. For example, if a categorical attribute contained four distinct values, the encoder would create four new binary attributes, one for each value where only one bit would be set for any of the four attributes. One encoded numeric attribute from each categorical attribute was then removed, reducing the number of encoded attributes for each categorical attribute from n to n-1 where n is the number of distinct values in the categorical attribute. This approach was chosen to combat the collinearity issue that exists when encoding categorical variables using a one-hot encoding method. The limitation with this approach is that it greatly increased the number of attributes of the dataset from 235 to 1756. To try and combat this, another data set that would be tested on was created without encoding certain attributes and instead decomposing them into numeric attributes. The two greatest contributors to the increased number of attributes were the Engine Type and Transmission attributes. First, the Engine Type attribute was converted to a numeric attribute representing the number of cylinders in car engine. For example, the value gasv6 would be converted to 6. Next, the Transmission attribute was decomposed into two binary attributes representing if the car was a manual or automatic transmission and one numeric attribute representing the number of gears. For example, 6speedmanual would be converted to a 6 in the number of gears attribute, a 1 in the manual transmission attribute and 0 in the automatic transmission attribute. The same one-hot encoding method described above was then applied to the decomposed attributes. This greatly reduced the number of attributes when compared to the fully encoded dataset, with the final number of attributes being 64.

#HIDDEN
#Centers code cell outputs
from IPython.display import display, HTML
CSS = """
.output {
    align-items: center;
}
"""
HTML('<style>{}</style>'.format(CSS))
#HIDDEN
#Import required python3 packages
import os
import numpy as np
import pandas as pd
import nbinteract as nbi
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import scale, StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
#Read data in from processed .csv file
#Fully Encoded data: processed using data_cleaner_fe.py
cwd = os.getcwd()
data_fe = pd.read_csv(cwd+"/car_data/fe_cars.csv")
data_fe = data_fe.drop(columns = ['Unnamed: 0'])
#Semi-Encoded data: processed using data_cleaner_se.py
data_se = pd.read_csv(cwd+"/car_data/se_cars.csv")
data_se = data_se.drop(columns = ['Unnamed: 0'])

data = [data_fe, data_se]
#data = [scale(x) for x in data]

print("Fully Encoded Dataframe Dimensions: %s" % str(data_fe.shape))
print("Semi-Encoded Dataframe Dimensions: %s" % str(data_se.shape))
Fully Encoded Dataframe Dimensions: (23844, 1756)
Semi-Encoded Dataframe Dimensions: (12873, 64)
#Prepare data for regression analysis
Y_fe, Y_se = data_fe['MSRP'], data_se['MSRP'] #Y_fe is fully encoded target, Y_se is semi-encoded target

X_fe, X_se = data_fe.drop(columns = ['MSRP']), data_se.drop(columns = ['MSRP']) #X_fe is fully encoded attributes, X_se is semi-encoded attributes

X_train_fe, X_test_fe, y_train_fe, y_test_fe = train_test_split(X_fe, Y_fe, test_size = 0.25)
X_train_se, X_test_se, y_train_se, y_test_se = train_test_split(X_se, Y_se, test_size = 0.25)

X_train, X_test = [X_train_fe, X_train_se], [X_test_fe, X_test_se]
y_train, y_test = [y_train_fe, y_train_se], [y_test_fe, y_test_se]
38541.97114578091

3.2. Least Squares Linear Regression

For our first regression model, we used the LinearRegression.fit function from the scikit-learn python library. This function takes a set of independent prediction vectors X and a set of values to predict Y. Using X and Y, it builds a regression model using ordinary least squares (OLS) linear regression. This regression model finds a best fitting line by determining the b coefficients of y = b0 + b1x1 + b2x2 +...+ bnx or yi = XiꞴ where i represents some particular observation instance. It does this by finding values for that minimizes a cost function.

#Model Fitting
lin = [LinearRegression().fit(X_train[0], y_train[0]), LinearRegression().fit(X_train[1], y_train[1])]
#Prediction
pred_lin = [lin[0].predict(X_test[0]), lin[1].predict(X_test[1])] 
#Metrics
mae_lin = [mean_absolute_error(y_test[0], pred_lin[0]), mean_absolute_error(y_test[1], pred_lin[1])]
rmse_lin = [np.sqrt(mean_squared_error(y_test[0], pred_lin[0])), np.sqrt(mean_squared_error(y_test[1], pred_lin[1]))]
r2_lin = [r2_score(y_test[0], pred_lin[0]), r2_score(y_test[1], pred_lin[1])]

3.3. Ridge Regression

The second regression model we used was ridge regression. To build the model, we used the Ridge.fit function from the scikit-learn python library. Ridge regression works similarly to ordinary least squares but it adds a bias value to the cost function, which helps to control variance. This helps with issues of overfitting and high multicollinearity.

#Model Fitting
#alpha = 0.1 found by GridSearchCV
ridge = [Ridge(alpha = 0.1).fit(X_train[0], y_train[0]), Ridge(alpha = 0.1).fit(X_train[1], y_train[1])]
#Prediction
pred_ridge = [ridge[0].predict(X_test[0]), ridge[1].predict(X_test[1])] 
#Metrics
mae_ridge = [mean_absolute_error(y_test[0], pred_ridge[0]), mean_absolute_error(y_test[1], pred_ridge[1])]
rmse_ridge = [np.sqrt(mean_squared_error(y_test[0], pred_ridge[0])), np.sqrt(mean_squared_error(y_test[1], pred_ridge[1]))]
r2_ridge = [r2_score(y_test[0], pred_ridge[0]), r2_score(y_test[1], pred_ridge[1])]

3.4. Lasso Regression

The last regression model we used was lasso regression. This model was built with the Lasso.fit function from scikit-learn. Lasso is very similar to ridge regression, but it calculates the bias in a slightly different way. Lasso regression also has the effect of often causing b coefficients to become reduced to 0, which simplifies the model.

#Model Fitting
#alpha = 0.1 found by GridSearchCV
lasso = [Lasso(alpha=0.1, tol = 0.1).fit(X_train[0], y_train[0]), Lasso(alpha=0.1, tol = 1).fit(X_train[1], y_train[1])]
#Prediction
pred_lasso = [lasso[0].predict(X_test[0]), lasso[1].predict(X_test[1])] 
#Metrics
mae_lasso = [mean_absolute_error(y_test[0], pred_lasso[0]), mean_absolute_error(y_test[1], pred_lasso[1])]
rmse_lasso = [np.sqrt(mean_squared_error(y_test[0], pred_lasso[0])), np.sqrt(mean_squared_error(y_test[1], pred_lasso[1]))]
r2_lasso = [r2_score(y_test[0], pred_lasso[0]), r2_score(y_test[1], pred_lasso[1])]

3. Results and Discussion

A histogram below shows the distribution of MSRP values in the fully encoded dataset. The mean MSRP of this dataset is \$38,541. The semi-encoded dataset had a very similar distribution and mean MSRP of \$33,535. The discrepancy is caused by the number of instances in each dataset, which varys greatly.

#HIDDEN
plt.hist(Y_fe,bins = 1000)
plt.xlabel('MSRP')
plt.ylabel('Frequency')
plt.title('MSRP Histogram');
#HIDDEN
data = [["Linear Regression", "Fully Encoded", data_fe.shape, mae_lin[0], rmse_lin[0], r2_lin[0]],
        ["Linear Regression", "Semi-Encoded", data_se.shape, mae_lin[1], rmse_lin[1], r2_lin[1]],                                                                                         ["Ridge Regression", "Fully Encoded", data_fe.shape, mae_ridge[0], rmse_ridge[0], r2_ridge[0]],
        ["Ridge Regression", "Semi-Encoded", data_se.shape, mae_ridge[1], rmse_ridge[1], r2_ridge[1]],
        ["Lasso Regression", "Fully Encoded", data_fe.shape, mae_lasso[0], rmse_lasso[0], r2_lasso[0]],
        ["Lasso Regression", "Semi-Encoded", data_se.shape, mae_lasso[1], rmse_lasso[1], r2_lasso[1]]]                                                                                                                                                                               
columns = ["Model", "Dataset", "Dataframe Dimensions", "Mean Absolute Error", "Root Mean Square Error", "R^2 value"]
pd.DataFrame(data, columns=columns)
Model Dataset Dataframe Dimensions Mean Absolute Error Root Mean Square Error R^2 value
0 Linear Regression Fully Encoded (23844, 1756) 3.906211e+09 4.023996e+10 -1.373203e+12
1 Linear Regression Semi-Encoded (12873, 64) 7.068395e+03 1.360359e+04 7.450534e-01
2 Ridge Regression Fully Encoded (23844, 1756) 4.314606e+03 9.287216e+03 9.268540e-01
3 Ridge Regression Semi-Encoded (12873, 64) 7.082414e+03 1.360344e+04 7.450589e-01
4 Lasso Regression Fully Encoded (23844, 1756) 4.866754e+03 9.409368e+03 9.249172e-01
5 Lasso Regression Semi-Encoded (12873, 64) 8.503159e+03 1.495366e+04 6.919389e-01
#HIDDEN
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 10), sharey=True)
sns.regplot(lin[0].predict(X_test[0]), y_test[0], ax = axes[0])
sns.regplot(lin[1].predict(X_test[1]), y_test[1], ax = axes[1])
axes[0].set_title('Linear Regression: Fully Encoded'), axes[1].set_title('Linear Regression: Semi-Encoded')
axes[0].set_xlabel("Predicted Price (MSRP)"), axes[1].set_xlabel("Predicted Price (MSRP)")
axes[0].set_ylabel("Actual Price (MSRP)"), axes[1].set_ylabel("Actual Price (MSRP)")
plt.show();
#HIDDEN
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 10), sharey=True)
sns.regplot(ridge[0].predict(X_test[0]), y_test[0], ax = axes[0])
sns.regplot(ridge[1].predict(X_test[1]), y_test[1], ax = axes[1])
axes[0].set_title('Ridge Regression: Fully Encoded'), axes[1].set_title('Ridge Regression: Semi-Encoded')
axes[0].set_xlabel("Predicted Price (MSRP)"), axes[1].set_xlabel("Predicted Price (MSRP)")
axes[0].set_ylabel("Actual Price (MSRP)"), axes[1].set_ylabel("Actual Price (MSRP)")
plt.show();
#HIDDEN
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 10), sharey=True)
f1 = sns.regplot(lasso[0].predict(X_test[0]), y_test[0], ax = axes[0])
f2 = sns.regplot(lasso[1].predict(X_test[1]), y_test[1], ax = axes[1])
axes[0].set_title('Lasso Regression: Fully Encoded'), axes[1].set_title('Lasso Regression: Semi-Encoded')
axes[0].set_xlabel("Predicted Price (MSRP)"), axes[1].set_xlabel("Predicted Price (MSRP)")
axes[0].set_ylabel("Actual Price (MSRP)"), axes[1].set_ylabel("Actual Price (MSRP)")
plt.show();